2  Bivariate Viz

Use this file for practice with the bivariate viz in-class activity. Refer to the class website for details.

3 Examples: To Prepare for Class

# Import data
survey <- read.csv("https://ajohns24.github.io/data/112/about_us_2024.csv")

# How many students have now filled out the survey?

nrow(survey)
[1] 28
# What type of variables do we have?
str(survey)
'data.frame':   28 obs. of  4 variables:
 $ cafe_mac         : chr  "Cheesecake" "Cheese pizza" "udon noodles" "egg rolls" ...
 $ minutes_to_campus: int  15 10 4 7 5 35 5 15 7 20 ...
 $ fave_temp        : num  18 24 18 10 18 7 75 24 13 16 ...
 $ hangout          : chr  "the mountains" "a beach" "the mountains" "a beach" ...
# Attach a package needed to use the ggplot function
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Make a ggplot
ggplot(survey, aes(x = hangout)) +
  geom_bar(fill = "blue") + 
  theme_minimal()

#To understand temperature

ggplot(survey, aes(x = fave_temp)) +
  geom_bar(fill = "blue") + 
  theme_minimal()

ggplot(survey, aes(x = fave_temp)) +
  geom_histogram(color = "white", fill = "blue") +
  theme_minimal()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

4 In Class Exercises

# Load data
elections <- read.csv("https://mac-stat.github.io/data/election_2020_county.csv")

# Check it out
head(elections)
  state_name state_abbr historical    county_name county_fips total_votes_20
1    Alabama         AL        red Autauga County        1001          27770
2    Alabama         AL        red Baldwin County        1003         109679
3    Alabama         AL        red Barbour County        1005          10518
4    Alabama         AL        red    Bibb County        1007           9595
5    Alabama         AL        red  Blount County        1009          27588
6    Alabama         AL        red Bullock County        1011           4613
  repub_pct_20 dem_pct_20 winner_20 total_votes_16 repub_pct_16 dem_pct_16
1        71.44      27.02     repub          24661        73.44      23.96
2        76.17      22.41     repub          94090        77.35      19.57
3        53.45      45.79     repub          10390        52.27      46.66
4        78.43      20.70     repub           8748        76.97      21.42
5        89.57       9.57     repub          25384        89.85       8.47
6        24.84      74.70       dem           4701        24.23      75.09
  winner_16 total_votes_12 repub_pct_12 dem_pct_12 winner_12 total_population
1     repub          23909        72.63      26.58     repub            54907
2     repub          84988        77.39      21.57     repub           187114
3     repub          11459        48.34      51.25       dem            27321
4     repub           8391        73.07      26.22     repub            22754
5     repub          23980        86.49      12.35     repub            57623
6       dem           5318        23.51      76.31       dem            10746
  percent_white percent_black percent_asian percent_hispanic per_capita_income
1            76            18             1                2             24571
2            83             9             1                4             26766
3            46            46             0                5             16829
4            75            22             0                2             17427
5            88             1             0                8             20730
6            22            71             0                6             18628
  median_rent median_age
1         668       37.5
2         693       41.5
3         382       38.3
4         351       39.4
5         403       39.6
6         276       39.6

5 Exercise 0: Review

Part a. I would guess 65%?

ggplot(elections, aes(x = winner_20)) +
  geom_bar(fill = "blue") + 
  theme_minimal()

Part b.

ggplot(elections, aes(x = repub_pct_20)) +
  geom_histogram(fill = "blue") + 
  theme_minimal()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

6 Exercise 1: Quantitative vs Quantitative Intution Check

ggplot(elections, aes(y = repub_pct_20, x = repub_pct_16)) +
  geom_point() + 
  geom_smooth() + 
  theme_minimal()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

7 Exercise 2: 2 Quantitiative Variables

# Set up the plotting frame
# How does this differ than the frame for our histogram of repub_pct_20 alone? We have two different axis of different variables
ggplot(elections, aes(y = repub_pct_20, x = repub_pct_16))

# Add a layer of points for each county
# Take note of the geom! --> Geom_point!Cause there's a point for each
ggplot(elections, aes(y = repub_pct_20, x = repub_pct_16)) +
  geom_point()

# Change the shape of the points
# What happens if you change the shape to another number? Different point shapes!
ggplot(elections, aes(y = repub_pct_20, x = repub_pct_16)) +
  geom_point(shape = 4)

# YOU TRY: Modify the code to make the points "orange"
# NOTE: Try to anticipate if "color" or "fill" will be useful here. Then try both. color worked!
ggplot(elections, aes(y = repub_pct_20, x = repub_pct_16)) +
  geom_point(color = "orange" )

# Add a layer that represents each county by the state it's in
# Take note of the geom and the info it needs to run! geom_text instead of geom_point, needs an additional variable to generate it
ggplot(elections, aes(y = repub_pct_20, x = repub_pct_16)) +
  geom_text(aes(label = state_abbr))

8 Exercise 3: Reflect

Texas seemed to be some of the biggest outliers, however the relationship seems to be very strong. It appears to be a positive relationship.

9 Exercise 4: Visualizing trend

ggplot(elections, aes(y = repub_pct_20, x = repub_pct_16)) +
  geom_point() +
  geom_smooth()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Part a.

ggplot(elections, aes(y = repub_pct_20, x = repub_pct_16)) +
  geom_smooth()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Part b.

ggplot(elections, aes(y = repub_pct_20, x = repub_pct_16)) +
  geom_point() +
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

10 Exercise 5: Your Turn

# Scatterplot of repub_pct_20 vs median_rent
ggplot(elections, aes(y = repub_pct_20, x = median_rent)) +
  geom_point() +
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

# Scatterplot of repub_pct_20 vs median_age
ggplot(elections, aes(y = repub_pct_20, x = median_age)) +
  geom_point() +
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

From these two scatter plots, it appears that the median_rent has a stronger relationship than median_age as the linear relationship is a bit stronger. The relationship is negative between median_rent, as rent is lower higher republican % is higher.

11 Exercise 6: A Sad Scatterplot

A scatter plot would be challenging comparing a quantitative variable and a categorical variable, these would be a challenging numeric variable relationships.

ggplot(elections, aes(y = repub_pct_20, x = historical)) +
  geom_point()

12 Exercise 7: Quantitative vs Categorical - Violins & Boxes

# Side-by-side violin plots
ggplot(elections, aes(y = repub_pct_20, x = historical)) +
  geom_violin()

# Side-by-side boxplots (defined below)
ggplot(elections, aes(y = repub_pct_20, x = historical)) +
  geom_boxplot()

Reflect: It appears that county-level support was higher on average in the red states, and lower in blue states but was still higher than 50% for all.

13 Exercise 8: Quantitative vs Categorical - Intuition Check

ggplot(elections, aes(x = repub_pct_20, fill = historical)) +
  geom_density()

14 Exercise 9: Quantitative vs Categorical - Density Plots

# Name two "bad" things about this plot It is hard to understand what is really happening. It is also hard to understand the relative scales. 

ggplot(elections, aes(x = repub_pct_20, fill = historical)) +
  geom_density()

# What does scale_fill_manual do? Sets the colors for each density plot

ggplot(elections, aes(x = repub_pct_20, fill = historical)) +
  geom_density() +
  scale_fill_manual(values = c("blue", "purple", "red"))

# What does alpha = 0.5 do? This changes the transparency values

# Play around with different values of alpha, between 0 and 1
ggplot(elections, aes(x = repub_pct_20, fill = historical)) +
  geom_density(alpha = 0.75) +
  scale_fill_manual(values = c("blue", "purple", "red"))

# What does facet_wrap do?! Sets each up on its own graph path by category

ggplot(elections, aes(x = repub_pct_20, fill = historical)) +
  geom_density() +
  scale_fill_manual(values = c("blue", "purple", "red")) +
  facet_wrap(~ historical)

# Let's try a similar grouping strategy with a histogram instead of density plot.
# Why is this terrible? There is just so much going on and it is all on top of eachtother.

ggplot(elections, aes(x = repub_pct_20, fill = historical)) +
  geom_histogram(color = "white") +
  scale_fill_manual(values = c("blue", "purple", "red"))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

15 Exercise 10

Personally, I am a fan of density plots separated by category for visualizing the relationships between quantitative and categorical variables. Density plots allow for variation relative to box plots show a little more dynamics of the relationships relative to eachother. Density plots however do not as easily show the average or the quartiles as easily as boxplots do.

16 Exercise 11: Categorical vs Categorical - Intuition Check

# Plot 1: adjust this to recreate the top plot
ggplot(elections, aes(x = historical, fill = winner_20)) +
  geom_bar()

# Plot 2: adjust this to recreate the bottom plot
ggplot(elections, aes(x = winner_20)) +
  geom_bar() + 
  facet_wrap (~ historical)

17 Exercise 12: Categorical vs Categorical

# A stacked bar plot
# How are the "historical" and "winner_20" variables mapped to the plot, i.e. what roles do they play? Historical is the x axis - the predictor, while winner_20 is the response variable. 

ggplot(elections, aes(x = historical, fill = winner_20)) +
  geom_bar()

# A faceted bar plot
ggplot(elections, aes(x = winner_20)) +
  geom_bar() +
  facet_wrap(~ historical)

# A side-by-side bar plot
# Note the new argument to geom_bar sets them separated under together

ggplot(elections, aes(x = historical, fill = winner_20)) +
  geom_bar(position = "dodge")

# A proportional bar plot
# Note the new argument to geom_bar
ggplot(elections, aes(x = historical, fill = winner_20)) +
  geom_bar(position = "fill")